Skip to content

Conversation

@wine99
Copy link
Collaborator

@wine99 wine99 commented Oct 20, 2025

llama-server runs with default params (does not support -np >1)
llama-bench runs with -fa 1

YangleiZouIntel and others added 30 commits October 15, 2025 13:03
…e model

 * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.
@github-actions github-actions bot added the ggml label Oct 20, 2025
@wine99 wine99 force-pushed the reset_variable_state branch from abf2454 to 94934fa Compare October 21, 2025 03:21
@wine99 wine99 force-pushed the reset_variable_state branch from 94934fa to 8f93fe6 Compare October 21, 2025 05:32
@wine99 wine99 force-pushed the reset_variable_state branch from 8f93fe6 to 5f25e52 Compare October 21, 2025 06:59
Copy link
Collaborator

@cavusmustafa cavusmustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is another way to identify different sequences rather than reading kv_cache state. If there is a way to identify, how about creating a unique inference request for each sequence? So, instead of caching inference requests with cgraph pointer, we can create an id out of cgraph+seq_id ? If this is possible, we don't need to read the states as every inference request will manage its own variables anyways.

break;
}
}
static std::string device = getenv("GGML_OPENVINO_DEVICE") ? getenv("GGML_OPENVINO_DEVICE") : "CPU";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about doing something like this to avoid multiple getenv calls?

const std::string& getDevice() {
        static const std::string device_str= [] {
            const char* device_env = std::getenv("GGML_OPENVINO_DEVICE");
            return device_env ? std::string(device_env) : "CPU";
        }();
        return device_str;
}
.
.
.
static std::string device =  getDevice();

ov::AnyMap config;
if (device == "GPU") {
auto * disable_sdpa_optimization = getenv("GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION");
if (disable_sdpa_optimization && std::string(disable_sdpa_optimization) != "0") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check this for every iteration instead of checking once before compile_model?


// outdated if:
// 1. kv_len != kv_len_in_state
// 2. last row has different values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case are we deleting the previous kv_cache completely?

@wine99 wine99 force-pushed the dev_backend_openvino branch from 956dbf7 to d5038aa Compare November 4, 2025 08:51
@wine99 wine99 closed this Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants